Computation and Language 88
☆ Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen
Multimodal large language models (MLLMs) have shown impressive success across
modalities such as image, video, and audio in a variety of understanding and
generation tasks. However, current MLLMs are surprisingly poor at understanding
webpage screenshots and generating their corresponding HTML code. To address
this problem, we propose Web2Code, a benchmark consisting of a new large-scale
webpage-to-code dataset for instruction tuning and an evaluation framework for
the webpage understanding and HTML code translation abilities of MLLMs. For
dataset construction, we leverage pretrained LLMs to enhance existing
webpage-to-code datasets as well as generate a diverse pool of new webpages
rendered into images. Specifically, the inputs are webpage images and
instructions, while the responses are the webpage's HTML code. We further
include diverse natural language QA pairs about the webpage content in the
responses to enable a more comprehensive understanding of the web content. To
evaluate model performance in these tasks, we develop an evaluation framework
for testing MLLMs' abilities in webpage understanding and web-to-code
generation. Extensive experiments show that our proposed dataset is beneficial
not only to our proposed tasks but also in the general visual domain, while
previous datasets result in worse performance. We hope our work will contribute
to the development of general MLLMs suitable for web-based content generation
and task automation. Our data and code will be available at
https://github.com/MBZUAI-LLM/web2code.
comment: Website at https://mbzuai-llm.github.io/webpage2code/
☆ LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo
Large Language Models (LLMs) equipped with extensive world knowledge and
strong reasoning skills can tackle diverse tasks across domains, often by
posing them as conversation-style instruction-response pairs. In this paper, we
propose LLaRA: Large Language and Robotics Assistant, a framework which
formulates robot action policy as conversations, and provides improved
responses when trained with auxiliary data that complements policy learning.
LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity
to process state information as visual-textual prompts and generate optimal
policy decisions in text. To train such action policy VLMs, we first introduce
an automated pipeline to generate diverse high-quality robotics instruction
data from existing behavior cloning data. A VLM finetuned with the resulting
collection of datasets based on a conversation-style formulation tailored for
robotics tasks, can generate meaningful robot action policy decisions. Our
experiments across multiple simulated and real-world environments demonstrate
the state-of-the-art performance of the proposed LLaRA framework. The code,
datasets, and pretrained models are available at
https://github.com/LostXine/LLaRA.
☆ Scaling Synthetic Data Creation with 1,000,000,000 Personas
We propose a novel persona-driven data synthesis methodology that leverages
various perspectives within a large language model (LLM) to create diverse
synthetic data. To fully exploit this methodology at scale, we introduce
Persona Hub -- a collection of 1 billion diverse personas automatically curated
from web data. These 1 billion personas (~13% of the world's total population),
acting as distributed carriers of world knowledge, can tap into almost every
perspective encapsulated within the LLM, thereby facilitating the creation of
diverse synthetic data at scale for various scenarios. By showcasing Persona
Hub's use cases in synthesizing high-quality mathematical and logical reasoning
problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs
and tools (functions) at scale, we demonstrate persona-driven data synthesis is
versatile, scalable, flexible, and easy to use, potentially driving a paradigm
shift in synthetic data creation and applications in practice, which may have a
profound impact on LLM research and development.
comment: Work in progress
☆ ProgressGym: Alignment with a Millennium of Moral Progress
Frontier AI systems, including large language models (LLMs), hold increasing
influence over the epistemology of human users. Such influence can reinforce
prevailing societal values, potentially contributing to the lock-in of
misguided moral beliefs and, consequently, the perpetuation of problematic
moral practices on a broad scale. We introduce progress alignment as a
technical solution to mitigate this imminent risk. Progress alignment
algorithms learn to emulate the mechanics of human moral progress, thereby
addressing the susceptibility of existing alignment methods to contemporary
moral blindspots. To empower research in progress alignment, we introduce
ProgressGym, an experimental framework allowing the learning of moral progress
mechanics from history, in order to facilitate future progress in real-world
moral decisions. Leveraging 9 centuries of historical text and 18 historical
LLMs, ProgressGym enables codification of real-world progress alignment
challenges into concrete benchmarks. Specifically, we introduce three core
challenges: tracking evolving values (PG-Follow), preemptively anticipating
moral progress (PG-Predict), and regulating the feedback loop between human and
AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension
are inapplicable to these tasks. In response, we present lifelong and
extrapolative algorithms as baseline methods of progress alignment, and build
an open leaderboard soliciting novel algorithms and challenges. The framework
and the leaderboard are available at
https://github.com/PKU-Alignment/ProgressGym and
https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard
respectively.
☆ Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
LLMs process text as sequences of tokens that roughly correspond to words,
where less common words are represented by multiple tokens. However, individual
tokens are often semantically unrelated to the meanings of the words/concepts
they comprise. For example, Llama-2-7b's tokenizer splits the word
"northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which
correspond to semantically meaningful units like "north" or "east." Similarly,
the overall meanings of named entities like "Neil Young" and multi-word
expressions like "break a leg" cannot be directly inferred from their
constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups
of tokens into useful higher-level representations? In this work, we find that
last token representations of named entities and multi-token words exhibit a
pronounced "erasure" effect, where information about previous and current
tokens is rapidly forgotten in early layers. Using this observation, we propose
a method to "read out" the implicit vocabulary of an autoregressive LLM by
examining differences in token representations across layers, and present
results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is
the first attempt to probe the implicit vocabulary of an LLM.
comment: 13 pages, 14 figures. Code and data at
https://footprints.baulab.info/
☆ Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification
Automatic factuality verification of large language model (LLM) generations
is becoming more and more widely used to combat hallucinations. A major point
of tension in the literature is the granularity of this fact-checking: larger
chunks of text are hard to fact-check, but more atomic facts like propositions
may lack context to interpret correctly. In this work, we assess the role of
context in these atomic facts. We argue that fully atomic facts are not the
right representation, and define two criteria for molecular facts:
decontextuality, or how well they can stand alone, and minimality, or how
little extra information is added to achieve decontexuality. We quantify the
impact of decontextualization on minimality, then present a baseline
methodology for generating molecular facts automatically, aiming to add the
right amount of information. We compare against various methods of
decontextualization and find that molecular facts balance minimality with fact
verification accuracy in ambiguous settings.
☆ Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant
potential across various domains, including mitigating harm in LLM outputs,
enhancing text summarization, and mathematical reasoning. This paper introduces
an RLAIF framework for improving the code generation abilities of lightweight
(<1B parameters) LLMs. We specifically focus on code generation tasks that
require writing appropriate API calls, which is challenging due to the
well-known issue of hallucination in LLMs. Our framework extracts AI feedback
from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and
uses this data to train a reward model towards better alignment from smaller
LLMs. We run our experiments on the Gorilla dataset and meticulously assess the
quality of the model-generated code across various metrics, including AST,
ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate
accurately. Our approach significantly enhances the fine-tuned LLM baseline's
performance, achieving a 4.5% improvement in executability rate. Notably, a
smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger
fine-tuned baseline with 7B parameters, achieving a 1.0% higher code
executability rate.
☆ To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models
Polysemy and synonymy are two crucial interrelated facets of lexical
ambiguity. While both phenomena have been studied extensively in NLP, leading
to dedicated systems, they are often been considered independently. While many
tasks dealing with polysemy (e.g. Word Sense Disambiguiation or Induction)
highlight the role of a word's senses, the study of synonymy is rooted in the
study of concepts, i.e. meaning shared across the lexicon. In this paper, we
introduce Concept Induction, the unsupervised task of learning a soft
clustering among words that defines a set of concepts directly from data. This
task generalizes that of Word Sense Induction. We propose a bi-level approach
to Concept Induction that leverages both a local lemma-centric view and a
global cross-lexicon perspective to induce concepts. We evaluate the obtained
clustering on SemCor's annotated data and obtain good performances (BCubed F1
above 0.60). We find that the local and the global levels are mutually
beneficial to induce concepts and also senses in our setting. Finally, we
create static embeddings representing our induced concepts and use them on the
Word-in-Context task, obtaining competitive performances with the
State-of-the-Art.
☆ Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Black-box finetuning is an emerging interface for adapting state-of-the-art
language models to user needs. However, such access may also let malicious
actors undermine model safety. To demonstrate the challenge of defending
finetuning interfaces, we introduce covert malicious finetuning, a method to
compromise model safety via finetuning while evading detection. Our method
constructs a malicious dataset where every individual datapoint appears
innocuous, but finetuning on the dataset teaches the model to respond to
encoded harmful requests with encoded harmful responses. Applied to GPT-4, our
method produces a finetuned model that acts on harmful instructions 99% of the
time and avoids detection by defense mechanisms such as dataset inspection,
safety evaluations, and input/output classifiers. Our findings question whether
black-box finetuning access can be secured against sophisticated adversaries.
comment: 22 pages
☆ Understanding and Mitigating Language Confusion in LLMs
We investigate a surprising limitation of LLMs: their inability to
consistently generate text in a user's desired language. We create the Language
Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically
diverse languages with existing and newly-created English and multilingual
prompts. We evaluate a range of LLMs on monolingual and cross-lingual
generation reflecting practical use cases, finding that Llama Instruct and
Mistral models exhibit high degrees of language confusion and even the
strongest models fail to consistently respond in the correct language. We
observe that base and English-centric instruct models are more prone to
language confusion, which is aggravated by complex prompts and high sampling
temperatures. We find that language confusion can be partially mitigated via
few-shot prompting, multilingual SFT and preference tuning. We release our
language confusion benchmark, which serves as a first layer of efficient,
scalable multilingual evaluation at
https://github.com/for-ai/language-confusion.
☆ BioMNER: A Dataset for Biomedical Method Entity Recognition
Named entity recognition (NER) stands as a fundamental and pivotal task
within the realm of Natural Language Processing. Particularly within the domain
of Biomedical Method NER, this task presents notable challenges, stemming from
the continual influx of domain-specific terminologies in scholarly literature.
Current research in Biomedical Method (BioMethod) NER suffers from a scarcity
of resources, primarily attributed to the intricate nature of methodological
concepts, which necessitate a profound understanding for precise delineation.
In this study, we propose a novel dataset for biomedical method entity
recognition, employing an automated BioMethod entity recognition and
information retrieval system to assist human annotation. Furthermore, we
comprehensively explore a range of conventional and contemporary open-domain
NER methodologies, including the utilization of cutting-edge large-scale
language models (LLMs) customised to our dataset. Our empirical findings reveal
that the large parameter counts of language models surprisingly inhibit the
effective assimilation of entity extraction patterns pertaining to biomedical
methods. Remarkably, the approach, leveraging the modestly sized ALBERT model
(only 11MB), in conjunction with conditional random fields (CRF), achieves
state-of-the-art (SOTA) performance.
☆ LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models
Large language models (LLMs) require continual knowledge updates to stay
abreast of the ever-changing world facts, prompting the formulation of lifelong
model editing task. While recent years have witnessed the development of
various techniques for single and batch editing, these methods either fail to
apply or perform sub-optimally when faced with lifelong editing. In this paper,
we introduce LEMoE, an advanced Mixture of Experts (MoE) adaptor for lifelong
model editing. We first analyze the factors influencing the effectiveness of
conventional MoE adaptor in lifelong editing, including catastrophic
forgetting, inconsistent routing and order sensitivity. Based on these
insights, we propose a tailored module insertion method to achieve lifelong
editing, incorporating a novel KV anchor routing to enhance routing consistency
between training and inference stage, along with a concise yet effective
clustering-based editing order planning. Experimental results demonstrate the
effectiveness of our method in lifelong editing, surpassing previous model
editing techniques while maintaining outstanding performance in batch editing
task. Our code will be available.
☆ ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, Hayato Yamana
Tool-augmented large language models (LLMs) are rapidly being integrated into
real-world applications. Due to the lack of benchmarks, the community still
needs to fully understand the hallucination issues within these models. To
address this challenge, we introduce a comprehensive diagnostic benchmark,
ToolBH. Specifically, we assess the LLM's hallucinations through two
perspectives: depth and breadth. In terms of depth, we propose a multi-level
diagnostic process, including (1) solvability detection, (2) solution planning,
and (3) missing-tool analysis. For breadth, we consider three scenarios based
on the characteristics of the toolset: missing necessary tools, potential
tools, and limited functionality tools. Furthermore, we developed seven tasks
and collected 700 evaluation samples through multiple rounds of manual
annotation. The results show the significant challenges presented by the ToolBH
benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve a
total score of 45.3 and 37.0, respectively, on a scale of 100. In this
benchmark, larger model parameters do not guarantee better performance; the
training data and response strategies also play a crucial role in tool-enhanced
LLM scenarios. Our diagnostic analysis indicates that the primary reason for
model errors lies in assessing task solvability. Additionally, open-weight
models suffer from performance drops with verbose replies, whereas proprietary
models excel with longer reasoning.
☆ The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, Maarten de Rijke
Following multiple instructions is a crucial ability for large language
models (LLMs). Evaluating this ability comes with significant challenges: (i)
limited coherence between multiple instructions, (ii) positional bias where the
order of instructions affects model performance, and (iii) a lack of
objectively verifiable tasks. To address these issues, we introduce a benchmark
designed to evaluate models' abilities to follow multiple instructions through
sequential instruction following (SIFo) tasks. In SIFo, the successful
completion of multiple instructions is verifiable by examining only the final
instruction. Our benchmark evaluates instruction following using four tasks
(text modification, question answering, mathematics, and security rule
following), each assessing different aspects of sequential instruction
following. Our evaluation of popular LLMs, both closed-source and open-source,
shows that more recent and larger models significantly outperform their older
and smaller counterparts on the SIFo tasks, validating the benchmark's
effectiveness. All models struggle with following sequences of instructions,
hinting at an important lack of robustness of today's language models.
☆ Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model
This paper introduces a novel method of Progressive Low Rank Decomposition
(PLRD) tailored for the compression of large language models. Our approach
leverages a pre-trained model, which is then incrementally decompressed to
smaller sizes using progressively lower ranks. This method allows for
significant reductions in computational overhead and energy consumption, as
subsequent models are derived from the original without the need for retraining
from scratch. We detail the implementation of PLRD, which strategically
decreases the tensor ranks, thus optimizing the trade-off between model
performance and resource usage. The efficacy of PLRD is demonstrated through
extensive experiments showing that models trained with PLRD method on only 1B
tokens maintain comparable performance with traditionally trained models while
using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability
to generate multiple model sizes from a single foundational model, adapting
fluidly to varying computational and memory budgets. Our findings suggest that
PLRD could set a new standard for the efficient scaling of LLMs, making
advanced AI more feasible on diverse platforms.
☆ Into the Unknown: Generating Geospatial Descriptions for New Environments
Similar to vision-and-language navigation (VLN) tasks that focus on bridging
the gap between vision and language for embodied navigation, the new Rendezvous
(RVS) task requires reasoning over allocentric spatial relationships
(independent of the observer's viewpoint) using non-sequential navigation
instructions and maps. However, performance substantially drops in new
environments with no training data. Using opensource descriptions paired with
coordinates (e.g., Wikipedia) provides training data but suffers from limited
spatially-oriented text resulting in low geolocation resolution. We propose a
large-scale augmentation method for generating high-quality synthetic data for
new environments using readily available geospatial data. Our method constructs
a grounded knowledge-graph, capturing entity relationships. Sampled entities
and relations (`shop north of school') generate navigation instructions via (i)
generating numerous templates using context-free grammar (CFG) to embed
specific entities and relations; (ii) feeding the entities and relation into a
large language model (LLM) for instruction generation. A comprehensive
evaluation on RVS, showed that our approach improves the 100-meter accuracy by
45.83% on unseen environments. Furthermore, we demonstrate that models trained
with CFG-based augmentation achieve superior performance compared with those
trained with LLM-based augmentation, both in unseen and seen environments.
These findings suggest that the potential advantages of explicitly structuring
spatial information for text-based geospatial reasoning in previously unknown,
can unlock data-scarce scenarios.
☆ Simulating Financial Market via Large Language Model based Agents
Most economic theories typically assume that financial market participants
are fully rational individuals and use mathematical models to simulate human
behavior in financial markets. However, human behavior is often not entirely
rational and is challenging to predict accurately with mathematical models. In
this paper, we propose \textbf{A}gent-based \textbf{S}imulated
\textbf{F}inancial \textbf{M}arket (ASFM), which first constructs a simulated
stock market with a real order matching system. Then, we propose a large
language model based agent as the stock trader, which contains the profile,
observation, and tool-learning based action module. The trading agent can
comprehensively understand current market dynamics and financial policy
information, and make decisions that align with their trading strategy. In the
experiments, we first verify that the reactions of our ASFM are consistent with
the real stock market in two controllable scenarios. In addition, we also
conduct experiments in two popular economics research directions, and we find
that conclusions drawn in our \model align with the preliminary findings in
economics research. Based on these observations, we believe our proposed ASFM
provides a new paradigm for economic research.
☆ BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg
Incorporating speech understanding capabilities into pretrained
large-language models has become a vital research direction (SpeechLLM). The
previous architectures can be categorized as: i) GPT-style, prepend speech
prompts to the text prompts as a sequence of LLM inputs like a decoder-only
model; ii) T5-style, introduce speech cross-attention to each layer of the
pretrained LLMs. We propose BESTOW architecture to bring the BESt features from
TwO Worlds into a single model that is highly efficient and has strong
multitask capabilities. Moreover, there is no clear streaming solution for
either style, especially considering the solution should generalize to speech
multitask. We reformulate streamable SpeechLLM as a read-write policy problem
and unifies the offline and streaming research with BESTOW architecture. Hence
we demonstrate the first open-source SpeechLLM solution that enables Streaming
and Multitask at scale (beyond ASR) at the same time. This streamable solution
achieves very strong performance on a wide range of speech tasks (ASR, AST,
SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower
training/inference cost, and demonstrates LLM knowledge transferability to
speech.
☆ Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation
Damián Ariel Furman, Juan Junqueras, Z. Burçe Gümüslü, Edgar Altszyler, Joaquin Navajas, Ophelia Deroy, Justin Sulik
We present Reasons For and Against Vaccination (RFAV), a dataset for
predicting reasons for and against vaccination, and scientific authorities used
to justify them, annotated through nichesourcing and augmented using GPT4 and
GPT3.5-Turbo. We show how it is possible to mine these reasons in
non-structured text, under different task definitions, despite the high level
of subjectivity involved and explore the impact of artificially augmented data
using in-context learning with GPT4 and GPT3.5-Turbo. We publish the dataset
and the trained models along with the annotation manual used to train
annotators and define the task.
comment: 8 pages + references and appendix
☆ Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
Generating rationales that justify scoring decisions has been a promising way
to facilitate explainability in automated scoring systems. However, existing
methods do not match the accuracy of classifier-based methods. Plus, the
generated rationales often contain hallucinated information. To address these
issues, we propose a novel framework capable of generating more faithful
rationales and, more importantly, matching performance with classifier-based
black-box scoring systems. We first mimic the human assessment process by
querying Large Language Models (LLMs) to generate a thought tree. We then
summarise intermediate assessment decisions from each thought tree path for
creating synthetic rationale data and rationale preference data. Finally, we
utilise the generated synthetic data to calibrate LLMs through a two-step
training process: supervised fine-tuning and preference optimization. Extensive
experimental results demonstrate that our framework achieves a 38% assessment
performance improvement in the QWK score compared to prior work while producing
higher-quality rationales, as recognised by human evaluators and LLMs. Our work
sheds light on the effectiveness of performing preference optimization using
synthetic preference data obtained from thought tree paths.
☆ From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
We explore multi-step reasoning in vision-language models (VLMs). The problem
is challenging, as reasoning data consisting of multiple steps of visual and
language processing are barely available. To overcome the challenge, we first
introduce a least-to-most visual reasoning paradigm, which interleaves steps of
decomposing a question into sub-questions and invoking external tools for
resolving sub-questions. Based on the paradigm, we further propose a novel data
synthesis approach that can automatically create questions and multi-step
reasoning paths for an image in a bottom-up manner. Our approach divides the
complex synthesis task into a few simple sub-tasks, and (almost entirely)
relies on open-sourced models to accomplish the sub-tasks. Therefore, the
entire synthesis process is reproducible and cost-efficient, and the
synthesized data is quality guaranteed. With the approach, we construct $50$k
visual reasoning examples. Then, we develop a visual reasoner through
supervised fine-tuning, which is capable of generally enhancing the reasoning
abilities of a wide range of existing VLMs in a plug-and-play fashion.
Extensive experiments indicate that the visual reasoner can consistently and
significantly improve four VLMs on four VQA benchmarks. Our code and dataset
are available at https://github.com/steven-ccq/VisualReasoner.
☆ Interactive Topic Models with Optimal Transport
Topic models are widely used to analyze document collections. While they are
valuable for discovering latent topics in a corpus when analysts are unfamiliar
with the corpus, analysts also commonly start with an understanding of the
content present in a corpus. This may be through categories obtained from an
initial pass over the corpus or a desire to analyze the corpus through a
predefined set of categories derived from a high level theoretical framework
(e.g. political ideology). In these scenarios analysts desire a topic modeling
approach which incorporates their understanding of the corpus while supporting
various forms of interaction with the model. In this work, we present EdTM, as
an approach for label name supervised topic modeling. EdTM models topic
modeling as an assignment problem while leveraging LM/LLM based document-topic
affinities and using optimal transport for making globally coherent
topic-assignments. In experiments, we show the efficacy of our framework
compared to few-shot LLM classifiers, and topic models based on clustering and
LDA. Further, we show EdTM's ability to incorporate various forms of analyst
feedback and while remaining robust to noisy analyst inputs.
comment: Pre-print; Work in progress
☆ Paraphrase Types Elicit Prompt Engineering Capabilities
Much of the success of modern language models depends on finding a suitable
prompt to instruct the model. Until now, it has been largely unknown how
variations in the linguistic expression of prompts affect these models. This
study systematically and empirically evaluates which linguistic features
influence models through paraphrase types, i.e., different linguistic changes
at particular positions. We measure behavioral changes for five models across
120 tasks and six families of paraphrases (i.e., morphology, syntax, lexicon,
lexico-syntax, discourse, and others). We also control for other prompt
engineering factors (e.g., prompt length, lexical diversity, and proximity to
training data). Our results show a potential for language models to improve
tasks when their prompts are adapted in specific paraphrase types (e.g., 6.7%
median gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). In particular, changes in
morphology and lexicon, i.e., the vocabulary used, showed promise in improving
prompts. These findings contribute to developing more robust language models
capable of handling variability in linguistic expression.
☆ Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers
Erik Henriksson, Amanda Myntti, Anni Eskelinen, Selcen Erten-Johansson, Saara Hellström, Veronika Laippala
This article explores deep learning models for the automatic identification
of registers - text varieties such as news reports and discussion forums - in
web-based datasets across 16 languages. Web register (or genre) identification
would provide a robust solution for understanding the content of web-scale
datasets, which have become crucial in computational linguistics. Despite
recent advances, the potential of register classifiers on the noisy web remains
largely unexplored, particularly in multilingual settings and when targeting
the entire unrestricted web. We experiment with a range of deep learning models
using the new Multilingual CORE corpora, which includes 16 languages annotated
using a detailed, hierarchical taxonomy of 25 registers designed to cover the
entire unrestricted web. Our models achieve state-of-the-art results, showing
that a detailed taxonomy in a hierarchical multi-label setting can yield
competitive classification performance. However, all models hit a glass ceiling
at approximately 80% F1 score, which we attribute to the non-discrete nature of
web registers and the inherent uncertainty in labeling some documents. By
pruning ambiguous examples, we improve model performance to over 90%. Finally,
multilingual models outperform monolingual ones, particularly benefiting
languages with fewer training examples and smaller registers. Although a
zero-shot setting decreases performance by an average of 7%, these drops are
not linked to specific registers or languages. Instead, registers show
surprising similarity across languages.
☆ Investigating the Timescales of Language Processing with EEG and Language Models
This study explores the temporal dynamics of language processing by examining
the alignment between word representations from a pre-trained transformer-based
language model, and EEG data. Using a Temporal Response Function (TRF) model,
we investigate how neural activity corresponds to model representations across
different layers, revealing insights into the interaction between artificial
language models and brain responses during language comprehension. Our analysis
reveals patterns in TRFs from distinct layers, highlighting varying
contributions to lexical and compositional processing. Additionally, we used
linear discriminant analysis (LDA) to isolate part-of-speech (POS)
representations, offering insights into their influence on neural responses and
the underlying mechanisms of syntactic processing. These findings underscore
EEG's utility for probing language processing dynamics with high temporal
resolution. By bridging artificial language models and neural activity, this
study advances our understanding of their interaction at fine timescales.
comment: Accepted at the 2024 Conference on Cognitive Computational
Neuroscience (CCN 2024)
☆ Detecting Subtle Differences between Human and Model Languages Using Spectrum of Relative Likelihood
Human and model-generated texts can be distinguished by examining the
magnitude of likelihood in language. However, it is becoming increasingly
difficult as language model's capabilities of generating human-like texts keep
evolving. This study provides a new perspective by using the relative
likelihood values instead of absolute ones, and extracting useful features from
the spectrum-view of likelihood for the human-model text detection task. We
propose a detection procedure with two classification methods, supervised and
heuristic-based, respectively, which results in competitive performances with
previous zero-shot detection methods and a new state-of-the-art on short-text
detection. Our method can also reveal subtle differences between human and
model languages, which find theoretical roots in psycholinguistics studies. Our
code is available at https://github.com/CLCS-SUSTech/FourierGPT
comment: 13 pages, 12 figures
★ YuLan: An Open-source Large Language Model
Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ze-Feng Gao, Yueguo Chen, Weizheng Lu, Ji-Rong Wen
Large language models (LLMs) have become the foundation of many applications,
leveraging their extensive capabilities in processing and understanding natural
language. While many open-source LLMs have been released with technical
reports, the lack of training details hinders further research and development.
This paper presents the development of YuLan, a series of open-source LLMs with
$12$ billion parameters. The base model of YuLan is pre-trained on
approximately $1.7$T tokens derived from a diverse corpus, including massive
English, Chinese, and multilingual texts. We design a three-stage pre-training
method to enhance YuLan's overall capabilities. Subsequent phases of training
incorporate instruction-tuning and human alignment, employing a substantial
volume of high-quality synthesized data. To facilitate the learning of complex
and long-tail knowledge, we devise a curriculum-learning framework throughout
across these stages, which helps LLMs learn knowledge in an easy-to-hard
manner. YuLan's training is finished on Jan, 2024 and has achieved performance
on par with state-of-the-art LLMs across various English and Chinese
benchmarks. This paper outlines a comprehensive technical roadmap for
developing LLMs from scratch. Our model and codes are available at
https://github.com/RUC-GSAI/YuLan-Chat.
☆ AnomaLLMy -- Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions
This paper introduces AnomaLLMy, a novel technique for the automatic
detection of anomalous tokens in black-box Large Language Models (LLMs) with
API-only access. Utilizing low-confidence single-token predictions as a
cost-effective indicator, AnomaLLMy identifies irregularities in model
behavior, addressing the issue of anomalous tokens degrading the quality and
reliability of models. Validated on the cl100k_base dataset, the token set of
GPT-4, AnomaLLMy detected 413 major and 65 minor anomalies, demonstrating the
method's efficiency with just \$24.39 spent in API credits. The insights from
this research are expected to be beneficial for enhancing the robustness of and
accuracy of LLMs, particularly in the development and assessment of tokenizers.
comment: 6 pages
☆ BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering ACL 2024
Zheng Chu, Jingchang Chen, Qianglong Chen, Haotian Wang, Kun Zhu, Xiyuan Du, Weijiang Yu, Ming Liu, Bing Qin
Large language models (LLMs) have demonstrated strong reasoning capabilities.
Nevertheless, they still suffer from factual errors when tackling
knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising
approach. However, significant challenges still persist, including inaccurate
and insufficient retrieval for complex questions, as well as difficulty in
integrating multi-source knowledge. To address this, we propose Beam
Aggregation Reasoning, BeamAggR, a reasoning framework for knowledge-intensive
multi-hop QA. BeamAggR explores and prioritizes promising answers at each hop
of question. Concretely, we parse the complex questions into trees, which
include atom and composite questions, followed by bottom-up reasoning. For
atomic questions, the LLM conducts reasoning on multi-source knowledge to get
answer candidates. For composite questions, the LLM combines beam candidates,
explores multiple reasoning paths through probabilistic aggregation, and
prioritizes the most promising trajectory. Extensive experiments on four
open-domain multi-hop reasoning datasets show that our method significantly
outperforms SOTA methods by 8.5%. Furthermore, our analysis reveals that
BeamAggR elicits better knowledge collaboration and answer aggregation.
comment: Accepted to ACL 2024
☆ Scalable and Domain-General Abstractive Proposition Segmentation
Segmenting text into fine-grained units of meaning is important to a wide
range of NLP applications. The default approach of segmenting text into
sentences is often insufficient, especially since sentences are usually complex
enough to include multiple units of meaning that merit separate treatment in
the downstream task. We focus on the task of abstractive proposition
segmentation: transforming text into simple, self-contained, well-formed
sentences. Several recent works have demonstrated the utility of proposition
segmentation with few-shot prompted LLMs for downstream tasks such as
retrieval-augmented grounding and fact verification. However, this approach
does not scale to large amounts of text and may not always extract all the
facts from the input text. In this paper, we first introduce evaluation metrics
for the task to measure several dimensions of quality. We then propose a
scalable, yet accurate, proposition segmentation model. We model proposition
segmentation as a supervised task by training LLMs on existing annotated
datasets and show that training yields significantly improved results. We
further show that by using the fine-tuned LLMs as teachers for annotating large
amounts of multi-domain synthetic distillation data, we can train smaller
student models with results similar to the teacher LLMs. We then demonstrate
that our technique leads to effective domain generalization, by annotating data
in two domains outside the original training data and evaluating on them.
Finally, as a key contribution of the paper, we share an easy-to-use API for
NLP practitioners to use.
☆ NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations
Large language models (LLMs) achieve promising results in code generation
based on a given natural language description. They have been integrated into
open-source projects and commercial products to facilitate daily coding
activities. The natural language description in the prompt is crucial for LLMs
to comprehend users' requirements. Prior studies uncover that LLMs are
sensitive to the changes in the prompts, including slight changes that look
inconspicuous. However, the natural language descriptions often vary in
real-world scenarios (e.g., different formats, grammar, and wording). Prior
studies on the robustness of LLMs are often based on random perturbations and
such perturbations may not actually happen. In this paper, we conduct a
comprehensive study to investigate how are code LLMs robust to variations of
natural language description in real-world scenarios. We summarize 18
categories of perturbations of natural language and 3 combinations of
co-occurred categories based on our literature review and an online survey with
practitioners. We propose an automated framework, NLPerturbator, which can
perform perturbations of each category given a set of prompts. Through a series
of experiments on code generation using six code LLMs, we find that the
perturbed prompts can decrease the performance of code generation by a
considerable margin (e.g., up to 21.2%, and 4.8% to 6.1% on average). Our study
highlights the importance of enhancing the robustness of LLMs to real-world
variations in the prompts, as well as the essentiality of attentively
constructing the prompts.
☆ Direct Preference Knowledge Distillation for Large Language Models
In the field of large language models (LLMs), Knowledge Distillation (KD) is
a critical technique for transferring capabilities from teacher models to
student models. However, existing KD methods face limitations and challenges in
distillation of LLMs, including efficiency and insufficient measurement
capabilities of traditional KL divergence. It is shown that LLMs can serve as
an implicit reward function, which we define as a supplement to KL divergence.
In this work, we propose Direct Preference Knowledge Distillation (DPKD) for
LLMs. DPKD utilizes distribution divergence to represent the preference loss
and implicit reward function. We re-formulate KD of LLMs into two stages: first
optimizing and objective consisting of implicit reward and reverse KL
divergence and then improving the preference probability of teacher outputs
over student outputs. We conducted experiments and analysis on various datasets
with LLM parameters ranging from 120M to 13B and demonstrate the broad
applicability and effectiveness of our DPKD approach. Meanwhile, we prove the
value and effectiveness of the introduced implicit reward and output preference
in KD through experiments and theoretical analysis. The DPKD method outperforms
the baseline method in both output response precision and exact match
percentage. Code and data are available at https://aka.ms/dpkd.
☆ Belief Revision: The Adaptability of Large Language Models Reasoning
The capability to reason from text is crucial for real-world NLP
applications. Real-world scenarios often involve incomplete or evolving data.
In response, individuals update their beliefs and understandings accordingly.
However, most existing evaluations assume that language models (LMs) operate
with consistent information. We introduce Belief-R, a new dataset designed to
test LMs' belief revision ability when presented with new evidence. Inspired by
how humans suppress prior inferences, this task assesses LMs within the newly
proposed delta reasoning ($\Delta R$) framework. Belief-R features sequences of
premises designed to simulate scenarios where additional information could
necessitate prior conclusions drawn by LMs. We evaluate $\sim$30 LMs across
diverse prompting strategies and found that LMs generally struggle to
appropriately revise their beliefs in response to new information. Further,
models adept at updating often underperformed in scenarios without necessary
updates, highlighting a critical trade-off. These insights underscore the
importance of improving LMs' adaptiveness to changing information, a step
toward more reliable AI systems.
☆ Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation
Legal case retrieval for sourcing similar cases is critical in upholding
judicial fairness. Different from general web search, legal case retrieval
involves processing lengthy, complex, and highly specialized legal documents.
Existing methods in this domain often overlook the incorporation of legal
expert knowledge, which is crucial for accurately understanding and modeling
legal cases, leading to unsatisfactory retrieval performance. This paper
introduces KELLER, a legal knowledge-guided case reformulation approach based
on large language models (LLMs) for effective and interpretable legal case
retrieval. By incorporating professional legal knowledge about crimes and law
articles, we enable large language models to accurately reformulate the
original legal case into concise sub-facts of crimes, which contain the
essential information of the case. Extensive experiments on two legal case
retrieval benchmarks demonstrate superior retrieval performance and robustness
on complex legal case queries of KELLER over existing methods.
☆ Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
Multilingual pre-trained models (mPLMs) have shown impressive performance on
cross-lingual transfer tasks. However, the transfer performance is often
hindered when a low-resource target language is written in a different script
than the high-resource source language, even though the two languages may be
related or share parts of their vocabularies. Inspired by recent work that uses
transliteration to address this problem, our paper proposes a
transliteration-based post-pretraining alignment (PPA) method aiming to improve
the cross-lingual alignment between languages using diverse scripts. We select
two areal language groups, $\textbf{Mediterranean-Amharic-Farsi}$ and
$\textbf{South+East Asian Languages}$, wherein the languages are mutually
influenced but use different scripts. We apply our method to these language
groups and conduct extensive experiments on a spectrum of downstream tasks. The
results show that after PPA, models consistently outperform the original model
(up to 50% for some tasks) in English-centric transfer. In addition, when we
use languages other than English as sources in transfer, our method obtains
even larger improvements. We will make our code and models publicly available
at \url{https://github.com/cisnlp/Transliteration-PPA}.
comment: preprint
☆ MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
This paper introduces MM-Instruct, a large-scale dataset of diverse and
high-quality visual instruction data designed to enhance the
instruction-following capabilities of large multimodal models (LMMs). While
existing visual instruction datasets often focus on question-answering, they
struggle to generalize to broader application scenarios such as creative
writing, summarization, or image analysis. To address these limitations, we
propose a novel approach to constructing MM-Instruct that leverages the strong
instruction-following capabilities of existing LLMs to generate novel visual
instruction data from large-scale but conventional image captioning datasets.
MM-Instruct first leverages ChatGPT to automatically generate diverse
instructions from a small set of seed instructions through augmenting and
summarization. It then matches these instructions with images and uses an
open-sourced large language model (LLM) to generate coherent answers to the
instruction-image pairs. The LLM is grounded by the detailed text descriptions
of images in the whole answer generation process to guarantee the alignment of
the instruction data. Moreover, we introduce a benchmark based on the generated
instruction data to evaluate the instruction-following capabilities of existing
LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5
model on the generated data, denoted as LLaVA-Instruct, which exhibits
significant improvements in instruction-following capabilities compared to
LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models
are available at https://github.com/jihaonew/MM-Instruct.
comment: Dataset and models are available at
https://github.com/jihaonew/MM-Instruct
☆ Message du troisi{è}me type : irruption d'un tiers dans un dialogue en ligne
Our study focuses on Wikipedia talk pages, from a global perspective
analyzing contributors' behaviors in online interactions. Using a corpus
comprising all Wikipedia talk pages in French, totaling more than 300,000
discussion threads, we examine how discussions with more than two participants
(multiparty conversation) unfold and we specifically investigate the role of a
third participant's intervention when two Wikipedians have already initiated an
exchange. In this regard, we concentrate on the sequential structure of these
interactions in terms of articulation among different participants and aim to
specify this third message by exploring its lexical particularities, while also
proposing an initial typology of the third participant's message role and how
it aligns with preceding messages.
comment: in French language. JADT 2024 - 17es Journ{\'e}es internationales
d'Analyse statistique des Donn{\'e}es Textuelles, SeSLa (S{\'e}minaire des
Sciences du Langage de l'UCLouvain -- Site Saint-Louis); LASLA (Laboratoire
d'Analyse statistique des Langues anciennes de l'Universit{\'e} de
Li{\`e}ge), 2024, Bruxelles, Belgique
☆ Le sens de la famille : analyse du vocabulaire de la parent{é} par les plongements de mots
In this study, we propose a corpus analysis of an area of the French lexicon
that is both dense and highly structured: the vocabulary of family
relationships. Starting with a lexicon of 25 nouns designating the main
relationships (son, cousin, mother, grandfather, sister-in-law etc.), we
examine how these terms are positioned in relation to each other through
distributional analyses based on the use of these terms in corpora. We show
that distributional information can capture certain features that organize this
vocabulary (descent, alliance, siblings, genre), in ways that vary according to
the different corpora compared.
comment: in French language. JADT 2024 - 17es Journ{\'e}es internationales
d'Analyse statistique des Donn{\'e}es Textuelles, SeSLa (S{\'e}minaire des
Sciences du Langage de l'UCLouvain -- Site Saint-Louis), 2024, Bruxelles,
Belgique
☆ Uncertainty Quantification in Large Language Models Through Convex Hull Analysis
Uncertainty quantification approaches have been more critical in large
language models (LLMs), particularly high-risk applications requiring reliable
outputs. However, traditional methods for uncertainty quantification, such as
probabilistic models and ensemble techniques, face challenges when applied to
the complex and high-dimensional nature of LLM-generated outputs. This study
proposes a novel geometric approach to uncertainty quantification using convex
hull analysis. The proposed method leverages the spatial properties of response
embeddings to measure the dispersion and variability of model outputs. The
prompts are categorized into three types, i.e., `easy', `moderate', and
`confusing', to generate multiple responses using different LLMs at varying
temperature settings. The responses are transformed into high-dimensional
embeddings via a BERT model and subsequently projected into a two-dimensional
space using Principal Component Analysis (PCA). The Density-Based Spatial
Clustering of Applications with Noise (DBSCAN) algorithm is utilized to cluster
the embeddings and compute the convex hull for each selected cluster. The
experimental results indicate that the uncertainty of the model for LLMs
depends on the prompt complexity, the model, and the temperature setting.
comment: 17 pages
☆ Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Krishna C. Puvvada, Piotr Żelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg
Recent advances in speech recognition and translation rely on hundreds of
thousands of hours of Internet speech data. We argue that state-of-the art
accuracy can be reached without relying on web-scale data. Canary -
multilingual ASR and speech translation model, outperforms current
state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French,
Spanish, and German languages, while being trained on an order of magnitude
less data than these models. Three key factors enables such data-efficient
model: (1) a FastConformer-based attention encoder-decoder architecture (2)
training on synthetic data generated with machine translation and (3) advanced
training techniques: data-balancing, dynamic data blending, dynamic bucketing
and noise-robust fine-tuning. The model, weights, and training code will be
open-sourced.
comment: Accepted at Interspeech-2024
☆ DECOR: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection, Reasoning, and Rewriting
Coherence in writing, an aspect that second-language (L2) English learners
often struggle with, is crucial in assessing L2 English writing. Existing
automated writing evaluation systems primarily use basic surface linguistic
features to detect coherence in writing. However, little effort has been made
to correct the detected incoherence, which could significantly benefit L2
language learners seeking to improve their writing. To bridge this gap, we
introduce DECOR, a novel benchmark that includes expert annotations for
detecting incoherence in L2 English writing, identifying the underlying
reasons, and rewriting the incoherent sentences. To our knowledge, DECOR is the
first coherence assessment dataset specifically designed for improving L2
English writing, featuring pairs of original incoherent sentences alongside
their expert-rewritten counterparts. Additionally, we fine-tuned models to
automatically detect and rewrite incoherence in student essays. We find that
incorporating specific reasons for incoherence during fine-tuning consistently
improves the quality of the rewrites, achieving a result that is favored in
both automatic and human evaluations.
comment: 21 pages, 5 figures, 20 tables
☆ Designing and Evaluating Multi-Chatbot Interface for Human-AI Communication: Preliminary Findings from a Persuasion Task
The dynamics of human-AI communication have been reshaped by language models
such as ChatGPT. However, extant research has primarily focused on dyadic
communication, leaving much to be explored regarding the dynamics of human-AI
communication in group settings. The availability of multiple language model
chatbots presents a unique opportunity for scholars to better understand the
interaction between humans and multiple chatbots. This study examines the
impact of multi-chatbot communication in a specific persuasion setting:
promoting charitable donations. We developed an online environment that enables
multi-chatbot communication and conducted a pilot experiment utilizing two
GPT-based chatbots, Save the Children and UNICEF chatbots, to promote
charitable donations. In this study, we present our development process of the
multi-chatbot interface and present preliminary findings from a pilot
experiment. Analysis of qualitative and quantitative feedback are presented,
and limitations are addressed.
☆ Unlocking Varied Perspectives: A Persona-Based Multi-Agent Framework with Debate-Driven Text Planning for Argument Generation
Writing persuasive arguments is a challenging task for both humans and
machines. It entails incorporating high-level beliefs from various perspectives
on the topic, along with deliberate reasoning and planning to construct a
coherent narrative. Current language models often generate surface tokens
autoregressively, lacking explicit integration of these underlying controls,
resulting in limited output diversity and coherence. In this work, we propose a
persona-based multi-agent framework for argument writing. Inspired by the human
debate, we first assign each agent a persona representing its high-level
beliefs from a unique perspective, and then design an agent interaction process
so that the agents can collaboratively debate and discuss the idea to form an
overall plan for argument writing. Such debate process enables fluid and
nonlinear development of ideas. We evaluate our framework on argumentative
essay writing. The results show that our framework can generate more diverse
and persuasive arguments through both automatic and human evaluations.
☆ IDT: Dual-Task Adversarial Attacks for Privacy Protection
Natural language processing (NLP) models may leak private information in
different ways, including membership inference, reconstruction or attribute
inference attacks. Sensitive information may not be explicit in the text, but
hidden in underlying writing characteristics. Methods to protect privacy can
involve using representations inside models that are demonstrated not to detect
sensitive attributes or -- for instance, in cases where users might not trust a
model, the sort of scenario of interest here -- changing the raw text before
models can have access to it. The goal is to rewrite text to prevent someone
from inferring a sensitive attribute (e.g. the gender of the author, or their
location by the writing style) whilst keeping the text useful for its original
intention (e.g. the sentiment of a product review). The few works tackling this
have focused on generative techniques. However, these often create extensively
different texts from the original ones or face problems such as mode collapse.
This paper explores a novel adaptation of adversarial attack techniques to
manipulate a text to deceive a classifier w.r.t one task (privacy) whilst
keeping the predictions of another classifier trained for another task
(utility) unchanged. We propose IDT, a method that analyses predictions made by
auxiliary and interpretable models to identify which tokens are important to
change for the privacy task, and which ones should be kept for the utility
task. We evaluate different datasets for NLP suitable for different tasks.
Automatic and human evaluations show that IDT retains the utility of text,
while also outperforming existing methods when deceiving a classifier w.r.t
privacy task.
comment: 28 pages, 1 figure
☆ Mixture of In-Context Experts Enhance LLMs' Long Context Awareness
Many studies have revealed that large language models (LLMs) exhibit uneven
awareness of different contextual positions.Their limited context awareness can
lead to overlooking critical information and subsequent task failures. While
several approaches have been proposed to enhance LLMs' context awareness,
achieving both effectiveness and efficiency remains challenging.In this paper,
for LLMs utilizing RoPE as position embeddings, we introduce a novel method
called ``Mixture of In-Context Experts'' (MoICE) to address this challenge.
MoICE comprises two key components: a router integrated into each attention
head within LLMs and a lightweight router-only training optimization strategy:
(1) MoICE views each RoPE angle as an `in-context' expert, demonstrated to be
capable of directing the attention of a head to specific contextual positions.
Consequently, each attention head flexibly processes tokens using multiple RoPE
angles dynamically selected by the router to attend to the needed positions.
This approach mitigates the risk of overlooking essential contextual
information. (2) The router-only training strategy entails freezing LLM
parameters and exclusively updating routers for only a few steps. When applied
to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods
across multiple tasks on long context understanding and generation, all while
maintaining commendable inference efficiency.
comment: 14 pages, 5 figures
☆ SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Synthetic data generation has gained significant attention recently for its
utility in training large vision and language models. However, the application
of synthetic data to the training of multimodal context-augmented generation
systems has been relatively unexplored. This gap in existing work is important
because existing vision and language models (VLMs) are not trained specifically
for context-augmented generation. Resources for adapting such models are
therefore crucial for enabling their use in retrieval-augmented generation
(RAG) settings, where a retriever is used to gather relevant information that
is then subsequently provided to a generative model via context augmentation.
To address this challenging problem, we generate SK-VQA: a large synthetic
multimodal dataset containing over 2 million question-answer pairs which
require external knowledge to determine the final answer. Our dataset is both
larger and significantly more diverse than existing resources of its kind,
possessing over 11x more unique questions and containing images from a greater
variety of sources than previously-proposed datasets. Through extensive
experiments, we demonstrate that our synthetic dataset can not only serve as a
challenging benchmark, but is also highly effective for adapting existing
generative multimodal models for context-augmented generation.
♻ ☆ AutoMix: Automatically Mixing Language Models
Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam
Large language models (LLMs) are now available from cloud API providers in
various sizes and configurations. While this diversity offers a broad spectrum
of choices, effectively leveraging the options to optimize computational cost
and performance remains challenging. In this work, we present Automix, an
approach that strategically routes queries to larger LMs, based on the
approximate correctness of outputs from a smaller LM. Central to Automix are
two key technical contributions. First, it has a few-shot self-verification
mechanism, which estimates the reliability of its own outputs without requiring
extensive training. Second, given that self-verification can be noisy, it
employs a POMDP based router that can effectively select an appropriately sized
model, based on answer confidence. Experiments across five language models and
five challenging datasets show that Automix consistently surpasses strong
baselines, reducing computational cost by over 50% for comparable performance.
comment: The first two authors contributed equally. Work started and partly
done during Aman's internship at Google. This version adds results on
additional models and datasets
♻ ☆ MBIAS: Mitigating Bias in Large Language Models While Retaining Context
The deployment of Large Language Models (LLMs) in diverse applications
necessitates an assurance of safety without compromising the contextual
integrity of the generated content. Traditional approaches, including
safety-specific fine-tuning or adversarial testing, often yield safe outputs at
the expense of contextual meaning. This can result in a diminished capacity to
handle nuanced aspects of bias and toxicity, such as underrepresentation or
negative portrayals across various demographics. To address these challenges,
we introduce MBIAS, an LLM framework carefully instruction fine-tuned on a
custom dataset designed specifically for safety interventions. MBIAS is
designed to significantly reduce biases and toxic elements in LLM outputs while
preserving the main information. This work also details our further use of
LLMs: as annotator under human supervision and as evaluator of generated
content. Empirical analysis reveals that MBIAS achieves a reduction in bias and
toxicity by over 30\% in standard evaluations, and by more than 90\% in diverse
demographic tests, highlighting the robustness of our approach. We make the
dataset and the fine-tuned model available to the research community for
further investigation and ensure reproducibility. The code for this project can
be accessed here https://github.com/shainarazavi/MBIAS/tree/main.
Warning: This paper contains examples that may be offensive or upsetting.
♻ ☆ MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering
Large Language Models (LLMs), although powerful in general domains, often
perform poorly on domain-specific tasks like medical question answering (QA).
Moreover, they tend to function as "black-boxes," making it challenging to
modify their behavior. To address the problem, our study delves into retrieval
augmented generation (RAG), aiming to improve LLM responses without the need
for fine-tuning or retraining. Specifically, we propose a comprehensive
retrieval strategy to extract medical facts from an external knowledge base,
and then inject them into the query prompt for LLMs. Focusing on medical QA
using the MedQA-SMILE dataset, we evaluate the impact of different retrieval
models and the number of facts provided to the LLM. Notably, our
retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from
44.46% to 48.54%. This work underscores the potential of RAG to enhance LLM
performance, offering a practical approach to mitigate the challenges of
black-box LLMs.
comment: Accepted by AMIA 2024 Annual Symposium
♻ ☆ LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Memorization in large language models (LLMs) is a growing concern. LLMs have
been shown to easily reproduce parts of their training data, including
copyrighted work. This is an important problem to solve, as it may violate
existing copyright laws as well as the European AI Act. In this work, we
propose a systematic analysis to quantify the extent of potential copyright
infringements in LLMs using European law as an example. Unlike previous work,
we evaluate instruction-finetuned models in a realistic end-user scenario. Our
analysis builds on a proposed threshold of 160 characters, which we borrow from
the German Copyright Service Provider Act and a fuzzy text matching algorithm
to identify potentially copyright-infringing textual reproductions. The
specificity of countermeasures against copyright infringement is analyzed by
comparing model behavior on copyrighted and public domain data. We investigate
what behaviors models show instead of producing protected text (such as refusal
or hallucination) and provide a first legal assessment of these behaviors. We
find that there are huge differences in copyright compliance, specificity, and
appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous
perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing
a particularly low absolute number of potential copyright violations. Code will
be published soon.
comment: 10 pages, 3 figures
♻ ☆ A Small and Fast BERT for Chinese Medical Punctuation Restoration INTERSPEECH 2024
In clinical dictation, utterances after automatic speech recognition (ASR)
without explicit punctuation marks may lead to the misunderstanding of dictated
reports. To give a precise and understandable clinical report with ASR,
automatic punctuation restoration is required. Considering a practical
scenario, we propose a fast and light pre-trained model for Chinese medical
punctuation restoration based on 'pretraining and fine-tuning' paradigm. In
this work, we distill pre-trained models by incorporating supervised
contrastive learning and a novel auxiliary pre-training task (Punctuation Mark
Prediction) to make it well-suited for punctuation restoration. Our experiments
on various distilled models reveal that our model can achieve 95% performance
while 10% model size relative to state-of-the-art Chinese RoBERTa.
comment: 5 pages, 2 figures, Accepted by INTERSPEECH 2024
♻ ☆ Distributed Speculative Inference of Large Language Models
Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel
Accelerating the inference of large language models (LLMs) is an important
challenge in artificial intelligence. This paper introduces distributed
speculative inference (DSI), a novel distributed inference algorithm that is
provably faster than speculative inference (SI) [leviathan2023fast,
chen2023accelerating, miao2023specinfer] and traditional autoregressive
inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs,
requiring no training or architectural modifications, and it preserves the
target distribution.
Prior studies on SI have demonstrated empirical speedups (compared to non-SI)
but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs
often do not have matching drafters that are sufficiently fast and accurate. We
show a gap: SI gets slower than non-SI when using slower or less accurate
drafters. We close this gap by proving that DSI is faster than both SI and
non-SI given any drafters. By orchestrating multiple instances of the target
and drafters, DSI is not only faster than SI but also supports LLMs that cannot
be accelerated with SI.
Our simulations show speedups of off-the-shelf LLMs in realistic settings:
DSI is 1.29-1.92x faster than SI.
♻ ☆ How well ChatGPT understand Malaysian English? An Evaluation on Named Entity Recognition and Relation Extraction EMNLP 2023
Recently, ChatGPT has attracted a lot of interest from both researchers and
the general public. While the performance of ChatGPT in named entity
recognition and relation extraction from Standard English texts is
satisfactory, it remains to be seen if it can perform similarly for Malaysian
English. Malaysian English is unique as it exhibits morphosyntactic and
semantical adaptation from local contexts. In this study, we assess ChatGPT's
capability in extracting entities and relations from the Malaysian English News
(MEN) dataset. We propose a three-step methodology referred to as
\textbf{\textit{educate-predict-evaluate}}. The performance of ChatGPT is
assessed using F1-Score across 18 unique prompt settings, which were carefully
engineered for a comprehensive review. From our evaluation, we found that
ChatGPT does not perform well in extracting entities from Malaysian English
news articles, with the highest F1-Score of 0.497. Further analysis shows that
the morphosyntactic adaptation in Malaysian English caused the limitation.
However, interestingly, this morphosyntactic adaptation does not impact the
performance of ChatGPT for relation extraction.
comment: Accepted in Generation, Evaluation & Metrics (GEM) Workshop at EMNLP
2023
♻ ☆ Are LLM-based Evaluators Confusing NLG Quality Criteria? ACL 2024
Some prior work has shown that LLMs perform well in NLG evaluation for
different tasks. However, we discover that LLMs seem to confuse different
evaluation criteria, which reduces their reliability. For further verification,
we first consider avoiding issues of inconsistent conceptualization and vague
expression in existing NLG quality criteria themselves. So we summarize a clear
hierarchical classification system for 11 common aspects with corresponding
different criteria from previous studies involved. Inspired by behavioral
testing, we elaborately design 18 types of aspect-targeted perturbation attacks
for fine-grained analysis of the evaluation behaviors of different LLMs. We
also conduct human annotations beyond the guidance of the classification system
to validate the impact of the perturbations. Our experimental results reveal
confusion issues inherent in LLMs, as well as other noteworthy phenomena, and
necessitate further research and improvements for LLM-based evaluation.
comment: Accepted by ACL 2024
♻ ☆ NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes
We introduce NoteChat, a novel cooperative multi-agent framework leveraging
Large Language Models (LLMs) to generate patient-physician dialogues. NoteChat
embodies the principle that an ensemble of role-specific LLMs, through
structured role-play and strategic prompting, can perform their assigned roles
more effectively. The synergy among these role-playing LLMs results in a
cohesive and efficient dialogue generation. Evaluation on MTS-dialogue, a
benchmark dataset for patient-physician dialogues-note pairs, shows that models
trained with the augmented synthetic patient-physician dialogues by NoteChat
outperforms other state-of-the-art models for generating clinical notes. Our
comprehensive automatic and human evaluation demonstrates that NoteChat
substantially surpasses state-of-the-art models like ChatGPT and GPT-4 up to
22.78% by domain experts in generating superior synthetic patient-physician
dialogues based on clinical notes. NoteChat has the potential to engage
patients directly and help clinical documentation, a leading cause of physician
burnout.
♻ ☆ JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability
Large Language Models (LLMs) have demonstrated a remarkable potential in
medical knowledge acquisition and question-answering. However, LLMs can
potentially hallucinate and yield factually incorrect outcomes, even with
domain-specific pretraining. Previously, retrieval augmented generation (RAG)
has limited success in addressing hallucinations. Unlike previous methods in
RAG where the retrieval model was trained separately from the LLM, we introduce
JMLR (for Jointly trains LLM and information Retrieval) during the fine-tuning
phase. The synchronized training mechanism enhances JMLR's ability to retrieve
clinical guidelines and leverage medical knowledge to reason and answer
questions and reduces the demand for computational resources. We evaluated JMLR
on the important medical question-answering application. Our experimental
results demonstrate that JMLR-13B (70.5%) outperforms a previous
state-of-the-art open-source model using conventional pre-training and
fine-tuning Meditron-70B (68.9%) and Llama2-13B with RAG (67.7%) on a medical
question-answering dataset. Comprehensive evaluations reveal JMLR-13B enhances
reasoning quality and reduces hallucinations better than Claude3-Opus.
Additionally, JMLR-13B (148 GPU hours) also trains much faster than
Meditron-70B (42630 GPU hours). Through this work, we provide a new and
efficient knowledge enhancement method for healthcare, demonstrating the
potential of integrating retrieval and LLM training for medical
question-answering systems.
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various
generation tasks by leveraging latent variables to learn data distributions and
generate high-quality samples. Despite the field of explainable AI making
strides in interpreting machine learning models, understanding latent variables
in generative models remains challenging. This paper introduces
LatentExplainer, a framework for automatically generating semantically
meaningful explanations of latent variables in deep generative models.
LatentExplainer tackles three main challenges: inferring the meaning of latent
variables, aligning explanations with inductive biases, and handling varying
degrees of explainability. By perturbing latent variables and interpreting
changes in generated data, the framework provides a systematic approach to
understanding and controlling the data generation process, enhancing the
transparency and interpretability of deep generative models. We evaluate our
proposed method on several real-world and synthetic datasets, and the results
demonstrate superior performance in generating high-quality explanations of
latent variables.
♻ ☆ A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation ECML-PKDD
Yongkang Liu, Ercong Nie, Shi Feng, Zheng Hua, Zifeng Ding, Daling Wang, Yifei Zhang, Hinrich Schütze
Current state-of-the-art dialogue systems heavily rely on extensive training
datasets. However, challenges arise in domains where domain-specific training
datasets are insufficient or entirely absent. To tackle this challenge, we
propose a novel data \textbf{A}ugmentation framework for
\textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred
to as \textbf{AMD$^2$G}. The AMD$^2$G framework consists of a data augmentation
process and a two-stage training approach: domain-agnostic training and domain
adaptation training. We posit that domain corpora are a blend of
domain-agnostic and domain-specific features, with certain representation
patterns shared among diverse domains. Domain-agnostic training aims to enable
models to learn these common expressive patterns. To construct domain-agnostic
dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing
technique used to remove domain-specific features. By mitigating the effects of
domain-specific features, the model trained on the de-domained corpora can
effectively learn common expression patterns in different domains.
Subsequently, we adapt the learned domain-agnostic features to the target
domain through domain adaptation training. We conduct experiments on Chinese
dialogue datasets from five different domains and show that AMD$^2$G achieves
superior performance compared to both direct training on the target domain
corpus and collective training on all five domain corpora. Our work underscores
AMD$^2$G as a viable alternative solution for low-resource multi-domain
dialogue generation. Code and data associated with our work are available on
GitHub repository$^{\text 1}$.
comment: 17pages,ECML-PKDD
♻ ☆ Do prompt positions really matter?
Prompt-based models have gathered a lot of attention from researchers due to
their remarkable advancements in the fields of zero-shot and few-shot learning.
Developing an effective prompt template plays a critical role. However, prior
studies have mainly focused on prompt vocabulary searching or embedding
initialization within a predefined template with the prompt position fixed. In
this empirical study, we conduct the most comprehensive analysis to date of
prompt position for diverse Natural Language Processing (NLP) tasks. Our
findings quantify the substantial impact prompt position has on model
performance. We observe that the prompt positions used in prior studies are
often sub-optimal, and this observation is consistent even in widely used
instruction-tuned models. These findings suggest prompt position optimisation
as a valuable research direction to augment prompt engineering methodologies
and prompt position-aware instruction tuning as a potential way to build more
robust models in the future.
comment: 8 pages, 2 figures
♻ ☆ Advancing Airport Tower Command Recognition: Integrating Squeeze-and-Excitation and Broadcasted Residual Learning
Accurate recognition of aviation commands is vital for flight safety and
efficiency, as pilots must follow air traffic control instructions precisely.
This paper addresses challenges in speech command recognition, such as noisy
environments and limited computational resources, by advancing keyword spotting
technology. We create a dataset of standardized airport tower commands,
including routine and emergency instructions. We enhance broadcasted residual
learning with squeeze-and-excitation and time-frame frequency-wise
squeeze-and-excitation techniques, resulting in our BC-SENet model. This model
focuses on crucial information with fewer parameters. Our tests on five keyword
spotting models, including BC-SENet, demonstrate superior accuracy and
efficiency. These findings highlight the effectiveness of our model
advancements in improving speech command recognition for aviation safety and
efficiency in noisy, high-stakes environments. Additionally, BC-SENet shows
comparable performance on the common Google Speech Command dataset.
comment: Accepted by IALP 2024
♻ ☆ RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
Ekaterina Taktasheva, Maxim Bazhukov, Kirill Koncha, Alena Fenogenova, Ekaterina Artemova, Vladislav Mikhailov
Minimal pairs are a well-established approach to evaluating the grammatical
knowledge of language models. However, existing resources for minimal pairs
address a limited number of languages and lack diversity of language-specific
grammatical phenomena. This paper introduces the Russian Benchmark of
Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that
differ in grammaticality and isolate a morphological, syntactic, or semantic
phenomenon. In contrast to existing benchmarks of linguistic minimal pairs,
RuBLiMP is created by applying linguistic perturbations to automatically
annotated sentences from open text corpora and carefully curating test data. We
describe the data collection protocol and present the results of evaluating 25
language models in various scenarios. We find that the widely used language
models for Russian are sensitive to morphological and agreement-oriented
contrasts but fall behind humans on phenomena requiring understanding of
structural relations, negation, transitivity, and tense. RuBLiMP, the codebase,
and other materials are publicly available.
♻ ☆ TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models ACL 2024
Grasping the concept of time is a fundamental facet of human cognition,
indispensable for truly comprehending the intricacies of the world. Previous
studies typically focus on specific aspects of time, lacking a comprehensive
temporal reasoning benchmark. To address this, we propose TimeBench, a
comprehensive hierarchical temporal reasoning benchmark that covers a broad
spectrum of temporal reasoning phenomena. TimeBench provides a thorough
evaluation for investigating the temporal reasoning capabilities of large
language models. We conduct extensive experiments on GPT-4, LLaMA2, and other
popular LLMs under various settings. Our experimental results indicate a
significant performance gap between the state-of-the-art LLMs and humans,
highlighting that there is still a considerable distance to cover in temporal
reasoning. Besides, LLMs exhibit capability discrepancies across different
reasoning categories. Furthermore, we thoroughly analyze the impact of multiple
aspects on temporal reasoning and emphasize the associated challenges. We
aspire for TimeBench to serve as a comprehensive benchmark, fostering research
in temporal reasoning. Resources are available at:
https://github.com/zchuz/TimeBench
comment: Accepted to ACL 2024
♻ ☆ A synthetic data approach for domain generalization of NLI models
Natural Language Inference (NLI) remains an important benchmark task for
LLMs. NLI datasets are a springboard for transfer learning to other semantic
tasks, and NLI models are standard tools for identifying the faithfulness of
model-generated text. There are several large scale NLI datasets today, and
models have improved greatly by hill-climbing on these collections. Yet their
realistic performance on out-of-distribution/domain data is less
well-understood. We explore the opportunity for synthetic high-quality datasets
to adapt NLI models for zero-shot use in downstream applications across new and
unseen text domains. We demonstrate a new approach for generating NLI data in
diverse domains and lengths, so far not covered by existing training sets. The
resulting examples have meaningful premises, the hypotheses are formed in
creative ways rather than simple edits to a few premise tokens, and the labels
have high accuracy. We show that models trained on this data ($685$K synthetic
examples) have the best generalization to completely new downstream test
settings. On the TRUE benchmark, a T5-small model trained with our data
improves around $7\%$ on average compared to training on the best alternative
dataset. The improvements are more pronounced for smaller models, while still
meaningful on a T5 XXL model. We also demonstrate gains on test sets when
in-domain training data is augmented with our domain-general synthetic data.
♻ ☆ Chitchat as Interference: Adding User Backstories to Task-Oriented Dialogues LREC
During task-oriented dialogues (TODs), human users naturally introduce
chitchat that is beyond the immediate scope of the task, interfering with the
flow of the conversation. To address this issue without the need for expensive
manual data creation, we use few-shot prompting with Llama-2-70B to enhance the
MultiWOZ dataset with user backstories, a typical example of chitchat
interference in TODs. We assess the impact of this addition by testing two
models: one trained solely on TODs and another trained on TODs with a
preliminary chitchat interaction. Our analysis demonstrates that our enhanced
dataset poses a challenge for these systems. Moreover, we demonstrate that our
dataset can be effectively used for training purposes, enabling a system to
consistently acknowledge the user's backstory while also successfully moving
the task forward in the same turn, as confirmed by human evaluation. These
findings highlight the benefits of generating novel chitchat-TOD scenarios to
test TOD systems more thoroughly and improve their resilience to natural user
interferences
comment: Accepted @ LREC-COLING 2024
♻ ☆ MathChat: Converse to Tackle Challenging Math Problems with LLM Agents
Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, Chi Wang
Employing Large Language Models (LLMs) to address mathematical problems is an
intriguing research endeavor, considering the abundance of math problems
expressed in natural language across numerous science and engineering fields.
LLMs, with their generalized ability, are used as a foundation model to build
AI agents for different tasks. In this paper, we study the effectiveness of
utilizing LLM agents to solve math problems through conversations. We propose
MathChat, a conversational problem-solving framework designed for math
problems. MathChat consists of an LLM agent and a user proxy agent which is
responsible for tool execution and additional guidance. This synergy
facilitates a collaborative problem-solving process, where the agents engage in
a dialogue to solve the problems. We perform evaluation on difficult high
school competition problems from the MATH dataset. Utilizing Python, we show
that MathChat can further improve previous tool-using prompting methods by 6%.
comment: Update version
♻ ☆ A Unified Approach to Emotion Detection and Task-Oriented Dialogue Modeling
In current text-based task-oriented dialogue (TOD) systems, user emotion
detection (ED) is often overlooked or is typically treated as a separate and
independent task, requiring additional training. In contrast, our work
demonstrates that seamlessly unifying ED and TOD modeling brings about mutual
benefits, and is therefore an alternative to be considered. Our method consists
in augmenting SimpleToD, an end-to-end TOD system, by extending belief state
tracking to include ED, relying on a single language model. We evaluate our
approach using GPT-2 and Llama-2 on the EmoWOZ benchmark, a version of MultiWOZ
annotated with emotions. Our results reveal a general increase in performance
for ED and task results. Our findings also indicate that user emotions provide
useful contextual conditioning for system responses, and can be leveraged to
further refine responses in terms of empathy.
comment: Accepted @ IWSDS 2024
♻ ☆ M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Instruction finetuning (IFT) is critical for aligning Large Language Models
(LLMs) to follow instructions. While many effective IFT datasets have been
introduced recently, they predominantly focus on high-resource languages like
English. To better align LLMs across a broad spectrum of languages and tasks,
we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual,
Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed
by first selecting a diverse set of seed examples and then utilizing the
proposed Evol taxonomy to convert these seeds into complex and challenging
multi-turn instructions. We demonstrate the effectiveness of M2Lingual by
training LLMs of varying sizes and showcasing the enhanced performance across a
diverse set of languages. We contribute the 2 step Evol taxonomy with the
guided generation code: https://github.com/ServiceNow/M2Lingual, as well as the
first fully synthetic, general and task-oriented, multi-turn, multilingual
dataset built with Evol - M2Lingual:
https://huggingface.co/datasets/ServiceNow-AI/ M2Lingual - containing 182K
total IFT pairs, covering 70 languages and 17+ NLP tasks.
comment: 39 pages
♻ ☆ BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
In this paper, we present a new embedding model, called M3-Embedding, which
is distinguished for its versatility in Multi-Linguality, Multi-Functionality,
and Multi-Granularity. It can support more than 100 working languages, leading
to new state-of-the-art performances on multi-lingual and cross-lingual
retrieval tasks. It can simultaneously perform the three common retrieval
functionalities of embedding model: dense retrieval, multi-vector retrieval,
and sparse retrieval, which provides a unified model foundation for real-world
IR applications. It is able to process inputs of different granularities,
spanning from short sentences to long documents of up to 8192 tokens. The
effective training of M3-Embedding involves the following technical
contributions. We propose a novel self-knowledge distillation approach, where
the relevance scores from different retrieval functionalities can be integrated
as the teacher signal to enhance the training quality. We also optimize the
batching strategy, enabling a large batch size and high training throughput to
ensure the discriminativeness of embeddings. To the best of our knowledge,
M3-Embedding is the first embedding model which realizes such a strong
versatility. The model and code will be publicly available at
https://github.com/FlagOpen/FlagEmbedding.
♻ ☆ Large Language Model Enhanced Clustering for News Event Detection
The news landscape is continuously evolving, with an ever-increasing volume
of information from around the world. Automated event detection within this
vast data repository is essential for monitoring, identifying, and categorizing
significant news occurrences across diverse platforms. This paper presents an
event detection framework that leverages Large Language Models (LLMs) combined
with clustering analysis to detect news events from the Global Database of
Events, Language, and Tone (GDELT). The framework enhances event clustering
through both pre-event detection tasks (keyword extraction and text embedding)
and post-event detection tasks (event summarization and topic labelling). We
also evaluate the impact of various textual embeddings on the quality of
clustering outcomes, ensuring robust news categorization. Additionally, we
introduce a novel Cluster Stability Assessment Index (CSAI) to assess the
validity and robustness of clustering results. CSAI utilizes multiple feature
vectors to provide a new way of measuring clustering quality. Our experiments
indicate that the use of LLM embedding in the event detection framework has
significantly improved the results, demonstrating greater robustness in terms
of CSAI scores. Moreover, post-event detection tasks generate meaningful
insights, facilitating effective interpretation of event clustering results.
Overall, our experimental results indicate that the proposed framework offers
valuable insights and could enhance the accuracy in news analysis and
reporting.
♻ ☆ SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, Dahua Lin, Chao Yang
Large language models (LLMs) now support extremely long context windows, but
the quadratic complexity of vanilla attention results in significantly long
Time-to-First-Token (TTFT) latency. Existing approaches to address this
complexity require additional pretraining or finetuning, and often sacrifice
model accuracy. In this paper, we first provide both theoretical and empirical
foundations for near-lossless sparse attention. We find dynamically capturing
head-specific sparse patterns at runtime with low overhead is crucial. To
address this, we propose SampleAttention, an adaptive structured and
near-lossless sparse attention. Leveraging observed significant sparse
patterns, SampleAttention attends to a fixed percentage of adjacent tokens to
capture local window patterns, and employs a two-stage query-guided key-value
filtering approach, which adaptively select a minimum set of key-values with
low overhead, to capture column stripe patterns. Comprehensive evaluations show
that SampleAttention can seamlessly replace vanilla attention in off-the-shelf
LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42\times$
compared with FlashAttention.
♻ ☆ SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models ICML 2024
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, Wei Wang
Most of the existing Large Language Model (LLM) benchmarks on scientific
problem reasoning focus on problems grounded in high-school subjects and are
confined to elementary algebraic operations. To systematically examine the
reasoning capabilities required for solving complex scientific problems, we
introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a
carefully curated dataset featuring a range of collegiate-level scientific
problems from mathematics, chemistry, and physics domains. Based on the
dataset, we conduct an in-depth benchmarking study of representative
open-source and proprietary LLMs with various prompting strategies. The results
reveal that the current LLMs fall short of delivering satisfactory performance,
with the best overall score of merely 43.22%. Furthermore, through a detailed
user study, we categorize the errors made by LLMs into ten problem-solving
abilities. Our analysis indicates that no single prompting strategy
significantly outperforms the others and some strategies that demonstrate
improvements in certain problem-solving skills could result in declines in
other skills. We envision that SciBench will catalyze further developments in
the reasoning abilities of LLMs, thereby ultimately contributing to scientific
research and discovery.
comment: To appear at ICML 2024
♻ ☆ Active Preference Learning for Large Language Models
As large language models (LLMs) become more capable, fine-tuning techniques
for aligning with human intent are increasingly important. A key consideration
for aligning these models is how to most effectively use human resources, or
model resources in the case where LLMs themselves are used as oracles.
Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most
prominent example of such a technique, but is complex and often unstable.
Direct Preference Optimization (DPO) has recently been proposed as a simpler
and more stable alternative. In this work, we develop an active learning
strategy for DPO to make better use of preference labels. We propose a
practical acquisition function for prompt/completion pairs based on the
predictive entropy of the language model and a measure of certainty of the
implicit preference model optimized by DPO. We demonstrate how our approach
improves both the rate of learning and final performance of fine-tuning on
pairwise preference data.
comment: 13 pages, 5 figures, 6 tables
♻ ☆ UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models
Siyuan Wu, Yue Huang, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Xiangliang Zhang, Jianfeng Gao, Chaowei Xiao, Lichao Sun
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly
impacted various fields by enabling high-quality synthetic data generation and
reducing dependence on expensive human-generated datasets. Despite this,
challenges remain in the areas of generalization, controllability, diversity,
and truthfulness within the existing generative frameworks. To address these
challenges, this paper presents UniGen, a comprehensive LLM-powered framework
designed to produce diverse, accurate, and highly controllable datasets. UniGen
is adaptable, supporting all types of text datasets and enhancing the
generative process through innovative mechanisms. To augment data diversity,
UniGen incorporates an attribute-guided generation module and a group checking
feature. For accuracy, it employs a code-based mathematical assessment for
label verification alongside a retrieval-augmented generation technique for
factual validation. The framework also allows for user-specified constraints,
enabling customization of the data generation process to suit particular
requirements. Extensive experiments demonstrate the superior quality of data
generated by UniGen, and each module within UniGen plays a critical role in
this enhancement. Additionally, UniGen is applied in two practical scenarios:
benchmarking LLMs and data augmentation. The results indicate that UniGen
effectively supports dynamic and evolving benchmarking, and that data
augmentation improves LLM capabilities in various domains, including
agent-oriented abilities and reasoning skills.
♻ ☆ Concept-aware Data Construction Improves In-context Learning of Language Models ACL 2024
Many recent language models (LMs) are capable of in-context learning (ICL),
manifested in the LMs' ability to perform a new task solely from
natural-language instruction. Previous work curating in-context learners
assumes that ICL emerges from a vast over-parametrization or the scale of
multi-task training. However, recent theoretical work attributes the ICL
ability to concept-dependent training data and creates functional in-context
learners even in small-scale, synthetic settings.
In this work, we practically explore this newly identified axis of ICL
quality. We propose Concept-aware Training (CoAT), a framework for constructing
training scenarios that make it beneficial for the LM to learn to utilize the
analogical reasoning concepts from demonstrations. We find that by using CoAT,
pre-trained transformers can learn to better utilise new latent concepts from
demonstrations and that such ability makes ICL more robust to the functional
deficiencies of the previous models. Finally, we show that concept-aware
in-context learning is more effective for a majority of new tasks when compared
to traditional instruction tuning, resulting in a performance comparable to the
previous in-context learners using magnitudes of more training data.
comment: Long paper to appear in Findings of ACL 2024
♻ ☆ Latent Logic Tree Extraction for Event Sequence Explanation from LLMs
Modern high-stakes systems, such as healthcare or robotics, often generate
vast streaming event sequences. Our goal is to design an efficient,
plug-and-play tool to elicit logic tree-based explanations from Large Language
Models (LLMs) to provide customized insights into each observed event sequence.
Built on the temporal point process model for events, our method employs the
likelihood function as a score to evaluate generated logic trees. We propose an
amortized Expectation-Maximization (EM) learning framework and treat the logic
tree as latent variables. In the E-step, we evaluate the posterior distribution
over the latent logic trees using an LLM prior and the likelihood of the
observed event sequences. LLM provides a high-quality prior for the latent
logic trees, however, since the posterior is built over a discrete
combinatorial space, we cannot get the closed-form solution. We propose to
generate logic tree samples from the posterior using a learnable GFlowNet,
which is a diversity-seeking generator for structured discrete variables. The
M-step employs the generated logic rules to approximate marginalization over
the posterior, facilitating the learning of model parameters and refining the
tunable LLM prior parameters. In the online setting, our locally built,
lightweight model will iteratively extract the most relevant rules from LLMs
for each sequence using only a few iterations. Empirical demonstrations
showcase the promising performance and adaptability of our framework.
♻ ☆ Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models ACL 2024
Object hallucination has been an Achilles' heel which hinders the broader
applications of large vision-language models (LVLMs). Object hallucination
refers to the phenomenon that the LVLMs claim non-existent objects in the
image. To mitigate the object hallucinations, instruction tuning and external
model-based detection methods have been proposed, which either require
large-scare computational resources or depend on the detection result of
external models. However, there remains an under-explored field to utilize the
LVLM itself to alleviate object hallucinations. In this work, we adopt the
intuition that the LVLM tends to respond logically consistently for existent
objects but inconsistently for hallucinated objects. Therefore, we propose a
Logical Closed Loop-based framework for Object Hallucination Detection and
Mitigation, namely LogicCheckGPT. In specific, we devise logical consistency
probing to raise questions with logical correlations, inquiring about
attributes from objects and vice versa. Whether their responses can form a
logical closed loop serves as an indicator of object hallucination. As a
plug-and-play method, it can be seamlessly applied to all existing LVLMs.
Comprehensive experiments conducted on three benchmarks across four LVLMs have
demonstrated significant improvements brought by our method, indicating its
effectiveness and generality.
comment: Accept to ACL 2024; 19 Pages, 15 Figures, 6 Tables
♻ ☆ ANLS* -- A Universal Document Processing Metric for Generative Large Language Models
Traditionally, discriminative models have been the predominant choice for
tasks like document classification and information extraction. These models
make predictions that fall into a limited number of predefined classes,
facilitating a binary true or false evaluation and enabling the direct
calculation of metrics such as the F1 score. However, recent advancements in
generative large language models (GLLMs) have prompted a shift in the field due
to their enhanced zero-shot capabilities, which eliminate the need for a
downstream dataset and computationally expensive fine-tuning. However,
evaluating GLLMs presents a challenge as the binary true or false evaluation
used for discriminative models is not applicable to the predictions made by
GLLMs.
This paper introduces a new metric for generative models called ANLS* for
evaluating a wide variety of tasks, including information extraction and
classification tasks. The ANLS* metric extends existing ANLS metrics as a
drop-in-replacement and is still compatible with previously reported ANLS
scores. An evaluation of 7 different datasets, and more than 10 different GLLMs
together with 3 different prompting methods using the ANLS* metric is also
provided, demonstrating the importance of the proposed metric.
We also benchmark a novel approach to generate prompts for documents, called
SFT, against other prompting techniques such as LATIN. In 6 out of 7 cases, SFT
outperforms other techniques and improves the state-of-the-art, sometimes by as
much as $10$ percentage points.
Sources are available at https://github.com/deepopinion/anls_star_metric
♻ ☆ Does Geo-co-location Matter? A Case Study of Public Health Conversations during COVID-19
Social media platforms like Twitter (now X) have been pivotal in information
dissemination and public engagement, especially during COVID-19. A key goal for
public health experts was to encourage prosocial behavior that could impact
local outcomes such as masking and social distancing. Given the importance of
local news and guidance during COVID-19, the objective of our research is to
analyze the effect of localized engagement, on social media conversations. This
study examines the impact of geographic co-location, as a proxy for localized
engagement between public health experts (PHEs) and the public, on social
media. We analyze a Twitter conversation dataset from January 2020 to November
2021, comprising over 19 K tweets from nearly five hundred PHEs, along with
approximately 800 K replies from 350 K participants. Our findings reveal that
geo-co-location is associated with higher engagement rates, especially in
conversations on topics including masking, lockdowns, and education, and in
conversations with academic and medical professionals. Lexical features
associated with emotion and personal experiences were more common in
geo-co-located contexts. This research provides insights into how geographic
co-location influences social media engagement and can inform strategies to
improve public health messaging.
♻ ☆ Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People
Xidong Wang, Nuo Chen, Junyin Chen, Yan Hu, Yidong Wang, Xiangbo Wu, Anningzhe Gao, Xiang Wan, Haizhou Li, Benyou Wang
Despite the vast repository of global medical knowledge predominantly being
in English, local languages are crucial for delivering tailored healthcare
services, particularly in areas with limited medical resources. To extend the
reach of medical AI advancements to a broader population, we aim to develop
medical LLMs across the six most widely spoken languages, encompassing a global
population of 6.1 billion. This effort culminates in the creation of the
ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the
multilingual medical benchmark, the released Apollo models, at various
relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best
performance among models of equivalent size. Especially, Apollo-7B is the
state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite
models could be used to improve the multi-lingual medical capabilities of
larger models without fine-tuning in a proxy-tuning fashion. We will
open-source training corpora, code, model weights and evaluation benchmark.
comment: Preprint
♻ ☆ SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
Caishuang Huang, Wanxu Zhao, Rui Zheng, Huijie Lv, Shihan Dou, Sixian Li, Xiao Wang, Enyu Zhou, Junjie Ye, Yuming Yang, Tao Gui, Qi Zhang, Xuanjing Huang
As the development of large language models (LLMs) rapidly advances, securing
these models effectively without compromising their utility has become a
pivotal area of research. However, current defense strategies against jailbreak
attacks (i.e., efforts to bypass security protocols) often suffer from limited
adaptability, restricted general capability, and high cost. To address these
challenges, we introduce SafeAligner, a methodology implemented at the decoding
stage to fortify defenses against jailbreak attacks. We begin by developing two
specialized models: the Sentinel Model, which is trained to foster safety, and
the Intruder Model, designed to generate riskier responses. SafeAligner
leverages the disparity in security levels between the responses from these
models to differentiate between harmful and beneficial tokens, effectively
guiding the safety alignment by altering the output token distribution of the
target model. Extensive experiments show that SafeAligner can increase the
likelihood of beneficial tokens, while reducing the occurrence of harmful ones,
thereby ensuring secure alignment with minimal loss to generality.
♻ ☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts ACL 2024
Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya, Vatsal Gupta, Vivek Gupta, Dan Roth
Existing benchmarks for visual question answering lack in visual grounding
and complexity, particularly in evaluating spatial reasoning skills. We
introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of
visual question-answering multimodal language models in reasoning with
flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and
human-verified flowchart images from three distinct content sources, along with
22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks,
including information localization, decision-making, and logical progression.
We conduct a thorough baseline evaluation on a suite of both open-source and
proprietary multimodal language models using various strategies, followed by an
analysis of directional bias. The results underscore the benchmark's potential
as a vital tool for advancing the field of multimodal modeling, providing a
focused and challenging environment for enhancing model performance in visual
and logical reasoning tasks.
comment: Accepted in ACL 2024 (Findings), 21 pages, 7 figures, 9 Tables
♻ ☆ WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions
Language Models (LMs) are being proposed for mental health applications where
the heightened risk of adverse outcomes means predictive performance may not be
a sufficient litmus test of a model's utility in clinical practice. A model
that can be trusted for practice should have a correspondence between
explanation and clinical determination, yet no prior research has examined the
attention fidelity of these models and their effect on ground truth
explanations. We introduce an evaluation design that focuses on the robustness
and explainability of LMs in identifying Wellness Dimensions (WD). We focus on
two mental health and well-being datasets: (a) Multi-label Classification-based
MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against
expert-labeled explanations. The labels are based on Halbert Dunn's theory of
wellness, which gives grounding to our evaluation. We reveal four surprising
results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4
lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM fails to deliver any
remarkable improvements in performance or explanations. (2) Re-examining LMs'
predictions based on a confidence-oriented loss function reveals a significant
performance drop. (3) Across all LMs/LLMs, the alignment between attention and
explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental
health-specific LMs/LLMs overlook domain-specific knowledge and undervalue
explanations, causing these discrepancies. This study highlights the need for
further research into their consistency and explanations in mental health and
well-being.
comment: 26 pages, including reference and appendix sections, 8 figures, and
16 tables
♻ ☆ AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator
Artificial intelligence has significantly advanced healthcare, particularly
through large language models (LLMs) that excel in medical question answering
benchmarks. However, their real-world clinical application remains limited due
to the complexities of doctor-patient interactions. To address this, we
introduce \textbf{AI Hospital}, a multi-agent framework simulating dynamic
medical interactions between \emph{Doctor} as player and NPCs including
\emph{Patient}, \emph{Examiner}, \emph{Chief Physician}. This setup allows for
realistic assessments of LLMs in clinical scenarios. We develop the Multi-View
Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical
records and NPCs to evaluate LLMs' performance in symptom collection,
examination recommendations, and diagnoses. Additionally, a dispute resolution
collaborative mechanism is proposed to enhance diagnostic accuracy through
iterative discussions. Despite improvements, current LLMs exhibit significant
performance gaps in multi-turn interactions compared to one-step approaches.
Our findings highlight the need for further research to bridge these gaps and
improve LLMs' clinical diagnostic capabilities. Our data, code, and
experimental results are all open-sourced at
\url{https://github.com/LibertFan/AI_Hospital}.
comment: https://github.com/LibertFan/AI_Hospital
♻ ☆ Navigating LLM Ethics: Advancements, Challenges, and Future Directions
This study addresses ethical issues surrounding Large Language Models (LLMs)
within the field of artificial intelligence. It explores the common ethical
challenges posed by both LLMs and other AI systems, such as privacy and
fairness, as well as ethical challenges uniquely arising from LLMs. It
highlights challenges such as hallucination, verifiable accountability, and
decoding censorship complexity, which are unique to LLMs and distinct from
those encountered in traditional AI systems. The study underscores the need to
tackle these complexities to ensure accountability, reduce biases, and enhance
transparency in the influential role that LLMs play in shaping information
dissemination. It proposes mitigation strategies and future directions for LLM
ethics, advocating for interdisciplinary collaboration. It recommends ethical
frameworks tailored to specific domains and dynamic auditing systems adapted to
diverse contexts. This roadmap aims to guide responsible development and
integration of LLMs, envisioning a future where ethical considerations govern
AI advancements in society.
♻ ☆ The global landscape of academic guidelines for generative AI and Large Language Models
The integration of Generative Artificial Intelligence (GAI) and Large
Language Models (LLMs) in academia has spurred a global discourse on their
potential pedagogical benefits and ethical considerations. Positive reactions
highlight some potential, such as collaborative creativity, increased access to
education, and empowerment of trainers and trainees. However, negative
reactions raise concerns about ethical complexities, balancing innovation and
academic integrity, unequal access, and misinformation risks. Through a
systematic survey and text-mining-based analysis of global and national
directives, insights from independent research, and eighty university-level
guidelines, this study provides a nuanced understanding of the opportunities
and challenges posed by GAI and LLMs in education. It emphasizes the importance
of balanced approaches that harness the benefits of these technologies while
addressing ethical considerations and ensuring equitable access and educational
outcomes. The paper concludes with recommendations for fostering responsible
innovation and ethical practices to guide the integration of GAI and LLMs in
academia.
♻ ☆ Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and Challenges
Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wenhan Xia, Junjie Hu, Anh Tuan Luu, Shafiq Joty
In the rapidly evolving field of large language models (LLMs), data
augmentation (DA) has emerged as a pivotal technique for enhancing model
performance by diversifying training examples without the need for additional
data collection. This survey explores the transformative impact of LLMs on DA,
particularly addressing the unique challenges and opportunities they present in
the context of natural language processing (NLP) and beyond. From both data and
learning perspectives, we examine various strategies that utilize LLMs for data
augmentation, including a novel exploration of learning paradigms where
LLM-generated data is used for diverse forms of further training. Additionally,
this paper highlights the primary open challenges faced in this domain, ranging
from controllable data augmentation to multi-modal data augmentation. This
survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve
as a comprehensive guide for researchers and practitioners.
♻ ☆ MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations ICLR 2024
Hanlei Zhang, Xin Wang, Hua Xu, Qianrui Zhou, Kai Gao, Jianhua Su, jinyue Zhao, Wenrui Li, Yanting Chen
Multimodal intent recognition poses significant challenges, requiring the
incorporation of non-verbal modalities from real-world contexts to enhance the
comprehension of human intentions. Existing benchmark datasets are limited in
scale and suffer from difficulties in handling out-of-scope samples that arise
in multi-turn conversational interactions. We introduce MIntRec2.0, a
large-scale benchmark dataset for multimodal intent recognition in multi-party
conversations. It contains 1,245 dialogues with 15,040 samples, each annotated
within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope
samples, it also includes 5,736 out-of-scope samples appearing in multi-turn
contexts, which naturally occur in real-world scenarios. Furthermore, we
provide comprehensive information on the speakers in each utterance, enriching
its utility for multi-party conversational research. We establish a general
framework supporting the organization of single-turn and multi-turn dialogue
data, modality feature extraction, multimodal fusion, as well as in-scope
classification and out-of-scope detection. Evaluation benchmarks are built
using classic multimodal fusion methods, ChatGPT, and human evaluators. While
existing methods incorporating nonverbal information yield improvements,
effectively leveraging context information and detecting out-of-scope samples
remains a substantial challenge. Notably, large language models exhibit a
significant performance gap compared to humans, highlighting the limitations of
machine learning methods in the cognitive intent understanding task. We believe
that MIntRec2.0 will serve as a valuable resource, providing a pioneering
foundation for research in human-machine conversational interactions, and
significantly facilitating related applications. The full dataset and codes are
available at https://github.com/thuiar/MIntRec2.0.
comment: Accepted by ICLR 2024, Long Paper; The abstract is slightly modified
due to the length limitation
♻ ☆ Prompting Explicit and Implicit Knowledge for Multi-hop Question Answering Based on Human Reading Process COLING 2024
Pre-trained language models (PLMs) leverage chains-of-thought (CoT) to
simulate human reasoning and inference processes, achieving proficient
performance in multi-hop QA. However, a gap persists between PLMs' reasoning
abilities and those of humans when tackling complex problems. Psychological
studies suggest a vital connection between explicit information in passages and
human prior knowledge during reading. Nevertheless, current research has given
insufficient attention to linking input passages and PLMs' pre-training-based
knowledge from the perspective of human cognition studies. In this study, we
introduce a Prompting Explicit and Implicit knowledge (PEI) framework, which
uses prompts to connect explicit and implicit knowledge, aligning with human
reading process for multi-hop QA. We consider the input passages as explicit
knowledge, employing them to elicit implicit knowledge through unified prompt
reasoning. Furthermore, our model incorporates type-specific reasoning via
prompts, a form of implicit knowledge. Experimental results show that PEI
performs comparably to the state-of-the-art on HotpotQA. Ablation studies
confirm the efficacy of our model in bridging and integrating explicit and
implicit knowledge.
comment: This paper has been accepted at COLING 2024